【Day25】網路爬蟲-JavaScript動態網路擷取

2023 iThome 鐵人賽

DAY 25

自我挑戰組

網路爬蟲系列第 25 篇

15th鐵人賽

2023-10-10 23:04:53

751 瀏覽

分享至

儲存:Hahow課程資訊的動態網頁
為了分析動態網頁內容，我們可以使用Selenium取得 JavaScript 產生的網頁內容，即儲存成靜態網頁。請修改Python 程式 Ch7_3b.py，改儲存https://hahow.in/courses 課程資料的網頁內容。

from selenium import webdriver 
from bs4 import BeautifulSoup

driver = webdriver.Chrome （"./chromedriver"）
driver. implicitly_wait (10)
driver.get ("https://hahow.in/courses")
print (driver. title)
soup = BeautifulSoup (driver.page_source, "lxml"）
fp = open ("hahow.html", "w",encoding="utf8"）
fp. write (soup.prettify())
print （"寫入案hahow.html..."）
fp. close ()
driver.quit ()

上述程式碼載入 Hahow 網站的課程資訊後，使用 Beautiful Soup 剖析儲存成 hahow.html 的HTML網頁檔案。
Selenium 的 JavaScript 動態網頁擷取
我們可以建立 Python 程式擷取 JavaScript 動態產生的網頁內容，即取出 https://hahow.in/courses 網頁的所有課程名稱。

from selenium import webdriver

driver = webdriver.Chrome ("./ chromedriver")
driver.implicatly_wait （10）
url ="https://hahow.in/courses"
driver. get (url)

items = driver. find_elements_by_css_selector ("h4. title")

for item in items:
    print (item. text)
    
driver. quit ()

上述程式碼載入課程網頁後，呼叫 find_elements_by_css_selector()函數取出所有課程名稱的HTML元素<h4>，然後用 for/in 迴圈一一顯示課程名稱。